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Abstract 

We present a general method for deriving collapsed variational inference algo- 
rithms for probabilistic models in the conjugate exponential family. Our method 
unifies many existing approaches to collapsed variational inference. Our collapsed 
variational inference leads to a new lower bound on the marginal likelihood. We 
exploit the information geometry of the bound to derive much faster optimization 
methods based on conjugate gradients for these models. Our approach is very 
general and is easily applied to any model where the mean field update equations 
have been derived. Empirically we show significant speed-ups for probabilistic 
inference using our bound. 

1 Introduction 

Variational bounds provide a convenient approach to approximate inference in a range 
of intractable models. Classical variational optimisation is achieved through coordi- 
nate ascent which can be slow to converge. A popular solution [King and Lawrence, 
2006, Teh et al., 2007, Kurihara et al., 2007, Sung et al., 2008, Lazaro-Gredilla and 
Titsias, 2011, Lazaro-Gredilla et al., 2011] is to marginalize analytically a portion of 
the variational approximating distribution, removing this from the optimization. In this 
paper we provide a unifying framework for collapsed inference in the general class of 
models composed of conjugate-exponential graphs (CEGs). 

First we review the body of earlier work with a succinct and unifying derivation of 
the collapsed bounds. We describe how the applicability of the collapsed bound to any 
particular CEG can be determined with a simple d-separation test. Standard variational 
inference via coordinate ascent turns out to be steepest ascent with a unit step length 
on our unifying bound. This motivates us to consider natural gradients and conjugate 
gradients for fast optimization of these models. We apply our unifying approach to 
a range of models from the literature obtaining, often, an order of magnitude or more 
increase in convergence speed. Our unifying view allows collapsed variational methods 
to be integrated into general inference tools like infer.net [Minka et al., 2010]. 
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2 The Marginalised Variational Bound 

The advantages to marginalising analytically a subset of variables in variational bounds 
seem to be well understood: several different approaches have been suggested in the 
context of specific models. In Dirichlet process mixture models Kurihara et al. [2007] 
proposed a collapsed approach using both truncated stick-breaking and symmetric pri- 
ors. Sung et al. [2008] proposed 'latent space variational Bayes' where both the cluster- 
parameters and mixing weights were marginalised, again with some approximations. 
Teh et al. [2007] proposed a collapsed inference procedure for latent Dirichlet alloca- 
tion (LDA). In this paper we unify all these results from the perspective of the 'KL 
corrected bound' [King and Lawrence, 2006]. This lower bound on the model evi- 
dence is also an upper bound on the original variational bound, the difference between 
the two bounds is given by a Kullback Leibler divergence. The approach has also been 
referred to as the marginalised variational bound by Lazaro-Gredilla et al. [2011], 
Lazaro-Gredilla and Titsias [2011]. The connection between the KL corrected bound 
and the collapsed bounds is not immediately obvious. The key difference between the 
frameworks is the order in which the marginalisation and variational approximation are 
applied. However, for CEGs this order turns out to be irrelevant. Our framework leads 
to a more succinct derivation of the collapsed approximations. The resulting bound can 
then be optimised without recourse to approximations in either the bound's evaluation 
or its optimization. 

2.1 Variational Inference 

Assume we have a probabilistic model for data, T>, given parameters (and/or latent 
variables), X, Z, of the form p(V, X, Z) = p(V | Z, X)p(Z | X)p(X). In variational 
Bayes (see e.g. Bishop [2006]) we approximate the posterior p(Z,X|2?) by a dis- 
tribution q(Z, X). We use Jensen's inequality to derive a lower bound on the model 
evidence C, which serves as an objective function in the variational optimisation: 



For tractability the mean field (MF) approach assumes q factorises across its variables, 
q(Z, X) = q(Z)q(X.). It is then possible to implement an optimisation scheme which 
analytically optimises each factor alternately, with the optimal distribution given by 



and similarly for Z: these are often referred to as VBE and VBM steps. King and 
Lawrence [2006] substituted the expression for the optimal distribution (for example 
g*(X)) back into the bound (1), eliminating one set of parameters from the optimisa- 
tion, an approach that has been reused by Lazaro-Gredilla et al. [201 1], Lazaro-Gredilla 
and Titsias [201 1]. The resulting bound is not dependent on q(X). King and Lawrence 
[2006] referred to this new bound as 'the KL corrected bound'. The difference between 
the bound, which we denote £ KL , and a standard mean field approximation £ M f, is the 
Kullback Leibler divergence between the optimal form of q* (X) and the current g(X). 




(1) 




(2) 
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We rederive their bound by first using Jensen's inequality to construct the varia- 
tional lower bound on the conditional distribution, 

lnp(P|X) > / g(Z) In P( ^' ( 5 X) dZ 4 A. (3) 

This object turns out to be of central importance in computing the final KL-corrected 
bound and also in computing gradients, curvatures and the distribution of the collapsed 
variables q*(X). It is easy to see that it is a function of X which lower-bounds the log 
likelihood p(D | X), and indeed our derivation treats it as such. We now marginalize 
the conditioned variable from this expression, 

lnp(V) >\nj p(X) exp{A} dX 4 £ KL , (4) 

giving us the bound of King and Lawrence [2006] & Lazaro-Gredilla et al. [2011]. 
Note that one set of parameters was marginalised after the variational approximation 
was made. 

Using (2), this expression also provides the approximate posterior for the marginalised 
variables X: 

g*(X) =p(X)e £l - £KL (5) 

and e £KL appears as the constant of proportionality in the mean-field update equation 
(2). 



3 Partial Equivalence of the Bounds 

We can recover £ M f from £ KL by again applying Jensen's inequality, 

Ckl =lnj exp{A}dX > J q(X) In j^j exp{A}| dX, (6) 

which can be re-arranged to give the mean-field bound, 

and it follows that £ KL = £ M f + KL(<7*(X)||(/(X)) and 1 £ K l > jCmf- F° r a given 
q(Z), the bounds are equal after g(X) is updated via the mean field method: the ap- 
proximations are ultimately the same. The advantage of the new bound is to reduce 
the number of parameters in the optimisation. It is particularly useful when variational 
parameters are optimised by gradient methods. Since VBEM is equivalent to a steep- 
est descent gradient method with a fixed step size, there appears to be a lot to gain by 
combining the KLC bound with more sophisticated optimization techniques. 

'We use KL(-||-) to denote the Kullback Leibler divergence between two distributions. 
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3.1 Gradients 



Consider the gradient of the KL corrected bound with respect to the parameters of 
q(Z): 



(8) 



^ = ° XP{ ~ £kl} ^ / cx P{ £ i W X ) dX = E o*(x) [f| 
where we have used the relation (5). To find the gradient of the mean-field bound we 
note that it can be written in terms of our conditional bound (3) as £ M f = ^q(X) d + lnp(X) — lng(X) 
giving 

(9) 



-^-- E «(X) 



■dO z 

thus setting <?(X) = <z*(X) not only makes the bounds equal, £ MF = £kl, but also 
their gradients with respect to 9z- 

Sato [2001] has shown that the variational update equation can be interpreted as 
a gradient method, where each update is also a step in the steepest direction in the 
canonical parameters of q(Z). We can combine this important insight with the above 
result to realize that we have a simple method for computing the gradients of the KL 
corrected bound: we only need to look at the update expressions for the mean-field 
method. This result also reveals the weakness of standard variational Bayesian expec- 
tation maximization (VBEM): it is a steepest ascent algorithm. Honkela et al. [2010] 
looked to rectify this weakness by applying a conjugate gradient algorithm to the mean 
field bound. However, they didn't obtain a significant improvement in convergence 
speed. Our suggestion is to apply conjugate gradients to the KLC bound. Whilst the 
value and gradient of the MF bound matches that of the KLC bound after an update 
of the collapsed variables, the curvature is always greater. In practise this means that 
much larger steps (which we compute using conjugate gradient methods) can be taken 
when optimizing the KLC bound than for the MF bound leading to more rapid conver- 
gence. 



3.2 Curvature of the Bounds 



King and Lawrence [2006] showed empirically that the KLC bound could lead to faster 
convergence because the bounds differ in their curvature: the curvature of the KLC 
bound enables larger steps to be taken by an optimizer. We now derive analytical 
expressions for the curvature of both bounds. For the mean field bound we have 

O^JZmf ^ \ d 2 JZ\' 



del 



E 



9(X) 



del 



(10) 



and for the KLC bound, with some manipulation of (4) and using (5): 
<9 2 £kl d 2 e c ^ _ 9r ,„ r<9e £KL w o>e £K 



defde^ 
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In this result the first term is equal to (10), and the second two terms combine to be 
always positive semi-definite, proving King and Lawrence [2006]'s intuition about the 
curvature of the bound. When curvature is negative definite (e.g. near a maximum), 
the KLC bound's curvature is less negative definite, enabling larger steps to be taken in 
optimization. Figure 1(b) illustrates the effect of this as well as the bound's similarities. 



3.3 Relationship to Collapsed VB 

In collapsed inference some parameters are marginalized before applying the varia- 
tional bound. For example, Sung et al. [2008] proposed a latent variable model where 
the model parameters were marginalised, and Teh et al. [2007] proposed a nonpara- 
metric topic model where the document proportions were collapsed. These procedures 
lead to improved inference, or faster convergence. 

The KLC bound derivation we have provided also marginalises parameters, but af- 
ter a variational approximation is made. The difference between the two approaches is 
distilled in these expressions: 



InE, 



'(X) 



cxp E 9(z) [lnp(2?|X,Z) 



E 



9(Z) 



In \ Ep (X ) [p(P\X, Z) 



(12) 



where the left expression appears in the KLC bound, and the right expression appears 
in the bound for collapsed variational Bayes, with the remainder of the bounds being 
equal. Whilst appropriately conjugate formulation of the model will always ensure 
that the KLC expression is analytically tractable, the expectation in the collapsed VB 
expression is not. Sung et al. [2008] propose a first order approximation to the expec- 
tation of the form E q (z) /(Z) ~ /(E g (z) Z ), which reduces the right expression to 

the that on the left. Under this approximation 2 the KL corrected approach is equivalent 
to the collapsed variational approach. 



3.4 Applicability 

To apply the KLC bound we need to specify a subset, X, of variables to marginalize. 
We select the variables that break the dependency structure of the graph to enable 
the analytic computation of the integral in (4). Assuming the appropriate conjugate 
exponential structure for the model we are left with the requirement to select a sub-set 
that induces the appropriate factorisation. These induced factorisations are discussed 
in some detail in Bishop [2006]. They are factorisations in the approximate posterior 
which arise from the form of the variational approximation and from the structure of 
the model. These factorisations allow application of KLC bound, and can be identified 
using a simple d-separation test as Bishop discusses. 

The d-separation test involves checking for independence amongst the marginalised 
variables (X in the above) conditioned on the observed data V and the approximated 

2 Kurihara et al. [2007] and Teh et al. [2007] suggest a further second order correction and assume that 
that q(Z) is Gaussian to obtain tractability. This leads to additional correction terms that augment KLC 
bound. The form of these corrections would need to be determined on a case by case basis, and has in fact 
been shown to be less effective than those methods unified here [Asuncion et al., 2012]. 
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Figure 1: (a) An example directed graphical model on which we could use the KLC 
bound. Given the observed node C, the nodes A,F d-separate given nodes B,D,E. Thus 
we could make an explicit variational approximation for A,F, whilst marginalising 
B,D,E. Alternatively, we could select B,D,E for a parameterised approximate distri- 
bution, whilst marginalising A,F. (b) A sketch of the KLC and MF bounds. At the 
point where the mean field method has g(X) = <z*(X), the bounds are equal in value 
as well as in gradient. Away from the this point, the different between the bounds is 
the Kullback Leibler divergence between the current MF approximation for X and the 
implicit distribution q* (X) of the KLC bound. 

variables (Z in the above). The requirement is to select a sufficient set of variables, Z, 
such that the effective likelihood for X, given by (3) becomes conjugate to the prior. 
Figure 1(a) illustrates the d-separation test with application to the KLC bound. 

For latent variable models, it is often sufficient to select the latent variables for 
X whilst collapsing the model variables. For example, in the specific case of mixture 
models and topic models, approximating the component labels allows for the marginal- 
isation of the cluster parameters (topics allocations) and mixing proportions. This al- 
lowed Sung et al. [2008] to derive a general form for latent variable models, though 
our formulation is general to any conjugate exponential graph. 

4 Riemannian Gradient Based Optimisation 

Sato [2001] and Hoffman et al. [2012] showed that the VBEM procedure performs gra- 
dient ascent in the space of the natural parameters. Using the KLC bound to collapse 
the problem, gradient methods seem a natural choice for optimisation, since there are 
fewer parameters to deal with, and we have shown that computation of the gradients is 
straightforward (the variational update equations contain the model gradients). It turns 
out that the KLC bound is particularly amenable to Riemannian or natural gradient 
methods, because the information geometry of the exponential family distrubution(s), 
over which we are optimising, leads to a simple expression for the natural gradient. 
Previous investigations of natural gradients for variational Bayes [Honkela et al., 2010, 
Kuusela et al., 2009] required the inversion of the Fisher information at every step 
(ours does not), and also used VBEM steps for some parameters and Riemannian opti- 
misation for other variables. The collapsed nature of the KLC bound means that these 
VBEM steps are unnecessary: the bound can be computed by parameterizing the dis- 
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tribution of only one set of variables (q(Z)) whilst the implicit distribution of the other 
variables is given in terms of the first distribution and the data by equation (5). 

We optimize the lower bound £ KL with respect to the parameters of the approx- 
imating distribution of the non-collapsed variables. We showed in section 2 that the 
gradient of the KLC bound is given by the gradient of the standard MF variational 
bound, after an update of the collapsed variables. It is clear from their definition that 
the same is true of the natural gradients. 



4.1 Variable Transformations 

We can compute the natural gradient of our collapsed bound by considering the update 
equations of the non-collapsed problem as described above. However, if we wish to 
make use of more powerful optimisation methods like conjugate gradient ascent, it 
is helpful to re-parameterize the natural parameters in an unconstrained fashion. The 
natural gradient is given by [Amari and Nagaoka, 2007]: 

g(0) = G(er ld ^ (13) 

where G(9) is the Fisher information matrix whose i j* element is given by 

-d 2 \nq(X\0)- 



G ( e )[t,j] = - E 9(X|0) 



d0 [i] dO [j] 



(14) 



For exponential family distributions, this reduces to ygtp(O), where ip is the log- 
normaliser. Further, for exponential family distributions, the Fisher information in the 
canonical parameters (9) and that in the expectation parameters (rj) are reciprocal, and 
we also have G{6) — drj/dO. This means that the natural gradient in is given by 

gW=G(0) de^r^^r and «^ = -flr- (15) 

The gradient in one set of parameters provides the natural gradient in the other. Thus 
when our approximating distribution q is exponential family, we can compute the nat- 
ural gradient without the expensive matrix inverse. 



4.2 Steepest Ascent is Coordinate Ascent 

Sato [2001] showed that the VBEM algorithm was a gradient based algorithm. In 
fact, VBEM consists of taking unit steps in the direction of the natural gradient of the 
canonical parameters. From equation (9) and the work of Sato [2001], we see that 
the gradient of the KLC bound can be obtained by considering the standard mean- 
field update for the non-collapsed parameter Z. We confirm these relationships for the 
models studied in the next section in the supplementary material. 

Having confirmed that the VB-E step is equivalent to steepest-gradient ascent we 
now explore whether the procedure could be improved by the use of conjugate gradi- 
ents. 
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4.3 Conjugate Gradient Optimization 

One idea for solving some of the problems associated with steepest ascent is to ensure 
each gradient step is conjugate (geometrically) to the previous. Honkela et al. [2010] 
applied conjugate gradients to the standard mean field bound, we expect much faster 
convergence for the KLC bound due to its differing curvature. Since VBEM uses a step 
length of 1 to optimize, 3 we also used this step length in conjugate gradients. In the 
natural conjugate gradient method, the search direction at the I th iteration is given by 
Si = — gi + pSi-i. Empirically the Fletcher-Reeves method for estimating f3 worked 
well for us: 

a (gi)gi)j (MZ\ 

PFR = 7= ~ r (16) 

\gi-l)gi-l/»-l 

where (-, -)j denotes the inner product in Riemannian geometry, which is given by 
g T G(p)g. We note from Kuuselaetal. [2009] that this can be simplified since g T Gg = 
g T GG _1 g = g T g, and other conjugate methods, defined in the supplementary mate- 
rial, can be applied similarly. 

5 Experiments 

For empirical investigation of the potential speed ups we selected a range of probabilis- 
tic models. We provide derivations of the bound and fuller explanations of the models 
in the supplementary material. In each experiment, the algorithm was considered to 
have converged when the change in the bound or the Riemannian gradient reached be- 
low 10 -6 . Comparisons between optimisation procedures always used the same initial 
conditions (or set of initial conditions) for each method. First we recreate the mixture 
of Gaussians example described by Honkela et al. [2010]. 

5.1 Mixtures of Gaussians 

For a mixture of Gaussians, using the d-separation rule, we select for X the cluster 
allocation (latent) variables. These are parameterised through the softmax function for 
unconstrained optimisation. Our model includes a fully Bayesian treatment of the clus- 
ter parameters and the mixing proportions, whose approximate posterior distributions 
appear as (5). Full details of the algorithm derivation are given in the supplementary 
material. A neat feature is that we can make use of the discussion above to derive an 
expression for the natural gradient without a matrix inverse. 

In Honkela et al. [2010] data are drawn from a mixture of five two-dimensional 
Gaussians with equal weights, each with unit spherical covariance. The centers of 
the components are at (0,0) and (±R,±R). R is varied from 1 (almost completely 
overlapping) to 5 (completely separate). The model is initialised with eight components 
with an uninformative prior over the mixing proportions: the optimisation procedure is 
left to select an appropriate number of components. 

3 We empirically evaluated a line-search procedure, but found that in most cases that Wolfe-Powell con- 
ditions were met after a single step of unit length. 
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Table 1: Iterations to convergence for the mixture of Gaussians problem, with varying overlap 
(R). This table reports the average number of iterations taken to reach (within 10 nats of) the best 
known solution. For the more difficult scenarios (with more overlap in the clusters) the VBEM 
method failed to reach the optimum solution within 500 restarts 



CG. method 


R = 1 


R = 2 


R = 3 


R = A 


R = 5 


Polack-Ribiere 


3, 100.37 


15,698.57 


5,767.12 


1,613.09 


3, 046.25 


Hestenes-Stiefel 


1,371.55 


5,501.25 


5,922.4 


358.03 


172.39 


Fletcher-Reeves 


416.18 


1,161.35 


5,091.0 


792.10 


494.24 


VBEM 


oo 


oo 


oo 


992.07 


429.57 



Sung et al. [2008] reported that their collapsed method led to improved conver- 
gence over VBEM. Since our objective is identical, though our optimisation procedure 
different, we devised a metric for measuring the efficacy of our algorithms which also 
accounts for their propensity to fall into local minima. Using many randomised restarts, 
we measured the average number of iterations taken to reach the best-known optimum. 
If the algorithm converged at a lesser optimum, those iterations were included in the 
denomiator, but we didn't increment the numerator when computing the average. We 
compared three different conjugate gradient approaches and standard VBEM (which is 
also steepest ascent on the KLC bound) using 500 restarts. 

Table 1 shows the number of iterations required (on average) to come within 10 nats 
of the best known solution for three different conjugate-gradient methods and VBEM. 
VBEM sometimes failed to find the optimum in any of the 500 restarts. Even relaxing 
the stringency of our selection to 100 nats, the VBEM method was always at least twice 
as slow as the best conjugate method. 

5.2 Topic Models 

Latent Dirichlet allocation (LDA) [Blei et al., 2003] is a popular approach for extract- 
ing topics from documents. To demonstrate the KLC bound we applied it to 200 papers 
from the 2011 NIPS conference. The PDFs were preprocessed with pdftotext, re- 
moving non-alphabetical characters and coarsely filtering words by popularity to form 
a vocabulary size of 2000. 4 We selected the latent topic-assignment variables for pa- 
rameterisation, collapsing the topics and the document proportions. Conjugate gradient 
optimization was compared to the standard VBEM approach. 

We used twelve random initializations, starting each algorithm from each initial 
condition. Topic and document distributions where treated with fixed, uninformative 
priors. On average, the Hestenes-Steifel algorithm was almost ten times as fast as stan- 
dard VB, as shown in Table 2, whilst the final bound varied little between approaches. 

4 Some extracted topics are presented in the supplementary material. 
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Table 2: Time and iterations taken to run LDA on the NIPS 2011 corpus, ± one standard 
deviation, for two conjugate methods and VBEM. The Fletcher-Reeves conjugate algorithm is 
almost ten times as fast as VBEM. The value of the bound at the optimum was largely the same: 
deviations are likely just due to the choice of initialisations, of which we used 12. 



Method 


Time (minutes) 


Iterations 


Bound 


Hestenes-Steifel 
Fletcher-Reeves 
VBEM 


56.4 ± 18.5 

38.5 ± 8.7 
370 ± 105 


644.3 ± 214.5 
447.8 ± 100.5 

4, 459 ± 1,296 


-1,998, 780 ±201 
-1, 998, 743 ± 194 
-1,998, 732 ±241 



5.3 RNA-seq alignment 

An emerging problem in computational biology is inference of transcript structure and 
expression levels using next-generation sequencing technology (RNA-Seq). Several 
models have been proposed. The BitSeq method [Glaus et al., 2012] is based on a 
probabilistic model and uses Gibbs sampling for approximate inference. The sampler 
can suffer from particularly slow convergence due to the large size of the problem, 
which has six million latent variables for the data considered here. We implemented 
a variational version of their model and optimised it using VBEM and our collapsed 
Riemannian method. We applied the model to data described in Xu et al. [2010], a 
study of human microRNA. The model was initialised using four random initial con- 
ditions, and optimised using standard VBEM and the conjugate gradient versions of 
the algorithm. The Polack-Ribiere conjugate method performed very poorly for this 
problem, often giving negative conjugation: we omit it here. The solutions found for 
the other algorithms were all fairly close, with bounds coming within 60 nats. The 
VBEM method was dramatically outperformed by the Fletcher-Reeves and Hestenes- 
Steifel methods: it took 4600 ± 20 iterations to converge, whilst the conjugate methods 
took only 268 ± 4 and 265 ± 1 iterations to converge. At about 8 seconds per iteration, 
our collapsed Riemannian method requires around forty minutes, whilst VBEM takes 
almost eleven hours. All the variational approaches represent an improvement over a 
Gibbs sampler, which takes approximately one week to run for this data [Glaus et al., 
2012]. 

6 Discussion 

Under very general conditions (conjugate exponential family) we have shown the equiv- 
alence of collapsed variational bounds and marginalized variational bounds using the 
KL corrected perspective of King and Lawrence [2006]. We have provided a succinct 
derivation of these bounds, unifying several strands of work and laying the foundations 
for much wider application of this approach. 

When the collapsed variables are updated in the standard MF bound the KLC bound 
is identical to the MF bound in value and gradient. Sato [2001] has shown that coordi- 
nate ascent of the MF bound (as proscribed by VBEM updates) is equivalent to steepest 
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ascent of the MF bound using natural gradients. This implies that standard variational 
inference is also performing steepest ascent on the KLC bound. This equivalence be- 
tween natural gradients and the VBEM update equations means our method is quickly 
implementable for any model where the mean field update equations have been com- 
puted. It is only necessary to determine which variables to collapse using a d-separation 
test. Importantly this implies our approach can readily be incorporated in automated 
inference engines such as that provided by infer.net [Minka et al., 2010]. We'd like to 
emphasise the ease with which the method can be applied: we have provided deriva- 
tions of equivalencies of the bounds and gradients which should enable collapsed con- 
jugate optimisation of any existing mean field algorithm, with minimal changes to the 
software. Indeed our own implementations (see supplementary material) use just a few 
lines of code to switch between the VBEM and conjugate methods. 

The improved performance arises from the curvature of the KLC bound. We have 
shown that it is always less negative than that of the original variational bound al- 
lowing much larger steps in the variational parameters as King and Lawrence [2006] 
suggested. This also provides a gateway to second-order optimisation, which could 
prove even faster. 

We provided empirical evidence of the performance increases that are possible us- 
ing our method in three models. In a thorough exploration of the convergence prop- 
erties of a mixture of Gaussians model, we concluded that a conjugate Riemannian 
algorithm can find solutions that are not found with standard VBEM. In a large LDA 
model, we found that performance can be improved many times over that of the VBEM 
method. In the BitSeq model for differential expression of genes transcripts we showed 
that very large improvements in performance are possible for models with huge num- 
bers of latent variables. 
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Supplementary Material for: 

Fast Variational Inference 
in the Conjugate Exponential Family 

This supplementary material accompanies the NIPS paper on Fast Variational Inference 
in the Conjugate Exponential Family. Its purpose is to provide details of how our very 
general framework applies in the case of the specific models described in the paper. 
First we briefly mention the form of the three conjugate gradient algorithms we used 
in optimization. 



7 Conjugate gradient algorithms 

There are several different methods for approximating the parameter (3 in the conjugate 
gradient algorithm. We used the Polack-Ribiere, Fletch-Reeves or Hestenes-Stiefel 
methods: 

a (SijSi — Si— l)i 

PPR ~ 



(g»-l,gi-l)»-l 

(gi> gi)i 
(gi— 1> gi— l)i- 
(gii Si Si— l)i 



a (Si)Si)i /i-tn, 

(gi-l,gi-l)i-l 



(Si-l)Si - Si-l)i-l 

where {■,■)% denotes the inner product in Riemannian geometry, which is given by 
S T G(p)g 



8 Mixture of Gaussians 

A MoG model is defined as follows. We have a set of N D-dimensional vectors Y = 
{y n }n=i- The likelihood is 

K N 

P (Yto,L) = n n^y-i/^v)^ w 

k=l n=l 

where L is a collection of binary latent variables indicating cluster membership, L = 
{{Aifc}^=i}jfcLi anc l V i s a collection of cluster parameters, r\ = {/j, k , A fe }^ 1 

The prior over L is given by a multinomial distribution with components tv, which 
in turn have a Dirichlet prior with uniform concentrations for simplicity: 

K N K 

p(Li7r)=nrK n *' p(T)=^(a)n<" 1 

fe=ln=l fe=l 

with a. representing a K dimensional vector with elements a, and Re, being the nor- 
malising constant for the Dirichlet distribution, Rd(ol) = T(Ka)T(a)^ K . 
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Figure 2: A graphical model representation of the MoG model. A d-separation test 
quickly shows that it is possible to marginalise tv and 77 given a variational parameter- 
is ation of L. 

Finally we choose a conjugate Gaussian-Wishart prior for the cluster parameters 
which can be written 

InpOfc, A k) = \nR GW (S ,v ,n ) + V ° ° In |A fc | 

1 2 (20) 

--tr (A fe (KoMfeAtfe + KomomJ - 2K m fiJ + S )) 

where Row is the normalising constant, and is given by 

D 

R GW (S,v, k) - ISI^-^tt-^^kt (J] r((!/ + 1 - tO/2))- 1 . 

8.1 Applying the KLC bound 

The first task in applying the KLC bound is to select which variables to parameterise 
and which to marginalise. From the graphical model representation of the MoG prob- 
lem in Figure 2, we can see that we can select the latent variables Z = {L} for 
parameterisation, whilst marginalising the mixing proportions and cluster parameters 
(X = {71-, rj}). We note that it is possible to select the variables the other way around: 
parameterising tt and r] and marginalising L, but parameterisation of the latent vari- 
ables makes implementation a little simpler. 

We use a factorised multinomial distribution q(L) to approximate the posterior for 
p(L|Y), parameterised using the softmax functions so 

N K p k 

^=1111^' r nk = ^ . (21) 

n=l fe=l 2^i=l e "* 

We are now ready to apply the procedure described above to derive the KLC bound. 
First, 

lnp(Y|77, tt) > / q(L){ln P (Y\ V , L) + ln^L^)} dL + H L , (22) 



15 



where Hl is the entropy of the distribution q(L). We expand to give 

K 

£1 = \ Y] { - tr(A fe (f fc /x fe /x^ + C fe - 2/x fc y fe )) 

fc=i L (23) 

+f fc lnTTfe + fit In |A fc |} +H L -^f ln(2n) 

where r k = J2n=i r nk, C k = Z^ =1 r nfe y n yT, and y k = I]^ =1 r„ fe y„. The con- 
jugacy between the intermediate bound C\ and the prior now emerges, making the 
second integral in the KLC bound tractable. 

After exponentiating this expression and multiplying by the prior, p{rf)p{iz), we 
find that the integrals with respect to both rj and 7r are tractable. This result means that 
the only variational parameters needed are those of q(L). The integrals result in 

ND 

Acl = — ln(27r) + lni? Dl (a) - In R m (a') 

k (24) 
+inni? GH /(So,^o,K ) - ^2\nR GW (S k , v k ,n k ) + H L 

fe=i 

where we have defined 

a k = a + r k K k = k + r k 

m fe = (K m + y k )/K k v k = v^ + r k (25) 

S fc = S + C k + K m iiio - K k m k ml 
and a' represents a vector containing each a k . Some simplification of (24) leads to 

K 



£kl - ]T { lni> fe ) - f In n k - In \S k \ 

k t (26) 
+ ln r (K + 1 - d )/ 2 ) } + ff i + const 



d=i 

where const, contains terms independent of r. 

Equations (25) are similar to the update equations for the approximating distribu- 
tions in the VBEM methodology [see e.g. Bishop, 2006]. However, for our model they 
are simply intermediate variables, representing combinations of the true variational pa- 
rameters r, the data, and the model prior parameters. When optimizing the model with 
respect to the variational parameters, the dependency of these intermediate variables 
on r is not ignored as it would be in MF variational approach. 

The gradient of the MV bound (26) with respect to the parameters r is given by 



dC KL 

dr nk 



y^ 1 - i In | Sfc | +tp{a k ) -lnr nfe 



- f (y n -m fc ) T 5 fc 1 (y„-m fc )) (2?) 

D 

+ § 5>(K + 1 - <*)/2) - 1. 
d=i 
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Figure 3: A graphical model representation of Latent Dirichlet allocation. A d- 
separation test quickly shows that it is possible to marginalise 8 and <j> given a vari- 
ational parameterisation of L. 



Taking a step in this direction (in the valiables 7) yields exactly the VB-E step associ- 
ated with the mean-field bound, the gradient in r is the natural gradient in 7 (see paper 
section 4.1). 



9 Latent Dirichlet Allocation 

Latent Dirichlet allocation is a popular topic model. See Blei et al. [2003] for a thor- 
ough introduction. 

Suppose we have D documents, K topics and a vocabulary of size V. The d th 
document contains N d words W d = {wd n }n=i> an d eacn w °rd is represented as a 
binary vector Wdn € {0, 1} V . Each word is associated with a latent variable £d n , 
which assigns the word to a topic, thus £ dn € {0, 1} K . We'll use W to represent the 
colletion of all words, W — {W d } d=1 , and L to represent the collection of all latent 
variables L = {{£ dn }^ 1 }£ =1 . 

Each document has an associated vector of topic proportions, 9d € [0, 1] K , and 
each topic is represented by a vector of word proportions <f>k € [0, l] v . We assume a 
symmetrical prior distribution over topics in each document p{9 d ) = Dir(9 d \a), and 
similarly for words within topics. p{(j>k) = Dir((f> k \f3). 

The LDA generative model states that for each word, first the associated topic is 
drawn from the topic proportions for the document, and then the word is drawn from 
the selected topic. 



K 

P (e dn \e d ) = l[e e d r 



*;=i 

K V 



(28) 



k=l v=l 



9.1 The collapsed bound 

To derive the colapsed bound, we use a similar d-separation test as for the mixture 
model to select the latent variables as the parameteriser (non-collapsed) nodes. See 
Figure 3. 
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To proceed we assume a factorising multinomial posterior for L: 

D N d K 



d=l n=l k=l 



subject to the constraint Y^k=i ^ dnk = 1> which we enforce through a softmax repa- 
rameterisation 

r*,* = — k — • (30) 

E*'=i ePd - fc 

We proceed by deriving the conditional bound 

D N d K V D N d K 

lnp(W\6,(p) >£i = ^2^2^2^2(w dnv r dnk )ln(j) kv +^2^2^2(r dnk \n9 dk )+H[q(L)}. 

(j=lu=lfe=l»=l d=ln=lfe=l 

(31) 

To marginalise the variables 8, <f>, we exponentiate this bound and take the expecta- 
tion under the priors. This results in 



k v D K 

(Enil^n*) 

dk 



P (w) > f cxp{c 1 }p(e) P (cb) d^jjin <pe Li ^ (wd - r ^ ] n n °. 

J J fe=lw=l d=lfe=l 

fe=i «=i 

_D if 
d=l fe=l 

exp{tf[g(L)]}. 

(32) 

Careful inspection of the above reveals that the two integrals separate as expected, 
and result in the normalizers for each of the independent Dirichlet approximations. 
Taking the logarithm reults in 

D K 

£kl = D In R m (a) - ^ In R Dl (a' d ) + Kln R Dl (/3) - ^ In R Dl (fi' k ) + H[q(L)] 

d=l fe=l 

(33) 

where we have defined a' dk = a + Yln^i r dnk and P' kv = f3 + Y? d =i Enli w dnv r dnk . 

9.2 Topics found by LDA 

For completeness we show here some topics found by LDA on the NIPS conference 
data. 
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Table 3: some topics found using LDA on papers from the 201 1 NIPS conference. 



neural 


training 


distribution 


input 


feature 


gaussian 


neurons 


classification 


inference 


network 


class 


process 


fig 


tree 


prior 


estimate 


prediction 


sampling 


neuron 


label 


likelihood 


visual 


accuracy 


posterior 


nonlinear 


labels 


distributions 


linear 


classifier 


bayesian 



data 


features 


model 


points 


image 


models 


point 


object 


variables 


clustering 


images 


parameters 


distance 


objects 


structure 


dataset 


scene 


variable 


similarity 


recognition 


markov 


cluster 


reference 


observed 


manifold 


detection 


graphical 


spectral 


part 


hidden 



10 BitSeq Model 

The generative model for an RNA-seq assay is as follows. We assume that the ex- 
periment consists of a pile of RNA fragments, where the abundance of fragments from 
transcript T m in the assay is 8 m . The sequencer then selects a fragment at random from 
the pile, such that the probability of picking a fragment corresponding to transcript T rn 
is 6 m . Introducing a convenient membership vector l n for each read, we can write 

N M 

pirn = n n (34) 

n— 1 ra— 1 

where £ nm € {0, 1} is a binary variable which indicates whether the nth fragment 
came from the mth transcript (£ nm = 1) and is subject to J2 m =i ^ nm = We use ^ 
to represent the collection of all alignment variables. 

Both 6 and L are variables to be inferred, with 9 the main object of interest. 

Writing the collection of all reads as R = {r„}^ =1 , the likelihood of a set of 
alignments L is 

JV 

p(R|T,L)= Y[p(r n \T m ) tnm (35) 

n=l 

where T m represents the mth transcript, T represents the transcriptome. 

The values of p(r n \T m ) can be computed before performing inference in 8 since 
we are assuming a known transcriptome. We compute these values based on the quality 
of alignment of the read r„ to the transcript T m , using a model which can correct for 
sequence specific or fragmentation biases. The method is described in detatil in Glaus 
etal. [2012]. 

We specify a conjugate Dirichlet prior over the vector 0. 

llm=l 1 \ a m) m =l 
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Figure 4: A graphical model representation of the BitSeq model. A d-separation test 
quickly shows that it is possible to marginalise 9 given a variational parameterisation 
ofL. 



with a° = X)m=i °°m- a m represents our prior belief in the values of 6 m , and we use a 
relatively uninformative but proper prior a° m = 1 Vm = 1 . . . M. A priori, we assume 
that the concentrations are all equal, but with large uncertainty. 



10.1 The collapsed bound 

Figure 4 shows a graphical representation of the BitSeq model. It's clear that param- 
eterisation of the latent variables will allow us to collapse 9, or vica-versa. Selecting 
again the latent variables for parameterisation, X = {L}, Z = {9}, we first find the 
conditional bound as usual by: 



In p(R | T, 0) = In [p(R\L, T)p(L | 0) dL 



> E, (L) [lnp(R | L, T) + Inp(L | 6) - In g(L) 

N M 



(37) 



> ^nm(lnp(r„ I T m ) + \n9 m - ln£ nm ) 

n—1 rn — 1 

> A 

It's clear that this bound is conjugate to the prior for 9, so we can marginalise: 

JV M 

\np(R \T)>£ KL = Y,Y, £ ™ ( ln P( r « I T ™) _ ln <-) + In r(d°) - In T(a° + N) 



n—1 rn—1 



M 



(inr«)-inr« + i m )) 

m— 1 

(38) 



where l m = J2n=i ^™ anc ' we a ^ so nave tnat tne approximate posterior distribution for 
6 is a Dirichlet distribution with parameters a° m + i m . 
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